DESeq2ENSEMBL to the official gene symbolmap)
DESeq2
library.DESeq2 analysisDESeq2The upstream analysis will include the analysis from downloading fastq files to gene count. Each step of the analysis will be explained in detail in each section. The nf-core pipelines were used for all processes in this upstream analysis according to the reproducibility concern. Please see the detail of nf-core setting in my note for computer workbench.
Transcriptome analysis starts with fastq files as input files and
pass through 5 different processes including
### 1. Pre-alignment quality control (QC): To test the quanlity of fastq
file, the tools such as FastQC
Input = .fastq files
Output = the same .fastq files without any editing
Input = .fastq files
Output = .fastq files which adapter was cut.
#### 3.1 An organism with a reference genome: It is possible to infer
which transcripts are expressed by mapping the reads to the reference
genome (genome mapping) or transcriptome (transcriptome mapping).
To understand the different usage between Genome and Transcriptome
mapping, you need to firstly back to central dogma as picture showed
below. RNA-seq reads can be aligned directly to reference transcriptome.
However, transcriptome mapping is not suitable for discovery of novel
isoforms and splice junctions. Therefore, it depend on research
question.
#### 3.2 An organism without a reference genome: The reads need to be
assembled first into longer contigs (de novo assembly).
In my case, I want to quantify gene expression of Human without
concerning about novel isoforms and slice junction. Therefore, I will
use at least transcriptome mapping.
I choose STAR as my tool for alignment process. Mapping
will be done on genome/transciptome and we also want to know whether
which location(gene) it map to. Therefore, the additional files needs
for the alignment process are reference genome and gene annotation,
accordingly.
If we manually perform this alignment process using
STAR, we need 2 main steps including
- Generating a genome index: using --runMode genomeGenerate
- Reference genome alignment: The genome index from previous step will
be used as one of the inputs for this step.
Therefore, input and output for STAR tool are - Input = sample files
(.fastq files), reference genome (fasta file),
gene annotation (GTF)
- Output = SAM or BAM files which adapter was
cut.
The mapped reads will next be quantified the gene expression. I will
use RSEM for this process. If we manually perform this
quantification process using RSEM, we need 2 main steps
including
- Reference preparation: using rsem-prepare-reference -
Expression calculation: The RSEM reference from previous step will be
used as one of the inputs for this step using
rsem-calculate-expression
Therefore, input and output for RSEM tool are
- Input = mapped read from sample (BAM files), reference
genome (fasta file), gene annotation
(GTF)
- Output = Gene-level results (rsem.genes.results),
Isoform-level results (rsem.isoforms.results)
Here RSeQC will be used and we will observe the QC
result from all post-aligned files using MultiQC which will
provide as a .html file.
fastq files from GEOThe input file for upstream analysis of transcriptomics is the fastq
files. For this example, I will use the data from GSE197576.
The data from the GSE includes all data under the study, which we don’t
want all of those. Here, I want to compare normoxic and hypoxic
condition. Therefore, I includ only control cell sgCTRL
from hypoxic (GSM5920765,
GSM5920766)
and normoxic (GSM5920759,
GSM5920760)
condition.
By using nf-core fetchngs pipeline, the default
(FTP) will be used.
To download those mentioned fastq files, we have to create a file
named ids.csv where contains all GSM numbers
that we want to download.
# Go to directory where `ids.csv` file was saved
cd data/fetchngs
# Oberving the `ids.csv` file
cat ids.csv
Next, I define the needed information in nf-core/fetchngs
launch pipeline. After launch the workflow, I got
nf-params.json file. I download the file to run it
locally.
Here is my nf-params.json. Please note that we will run all
process on High Performance Computing (HPC), thus the file paths are
full path of HPC.
# Go to directory where the `nf-params.json file was saved
cd data/fetchngs
# Observing the `nf-params.json` file
cat nf-params.json
Now, All information is ready to perfrom
nf-core/fetchngs to download 4 fastq files. To start
nf-core/fetchngs process, I run the following command.
nextflow run nf-core/fetchngs -r 1.12.0 -name Hypox -params-file nf-params.json -profile docker
The process will start and show the picture below. Moreover, it will real-time update the status of each process.
After pipeline finished successfully, We will find these 4 fastq files as shown below.
As I mentioned above that aligning on reference genome/transcriptome
need reference genome and gene annotation.
Here, we will download the latest version of those two using the command
below.
# Find the latest version
latest_release=$(curl -s 'http://rest.ensembl.org/info/software?content-type=application/json' | grep -o '"release":[0-9]*' | cut -d: -f2)
# Download the Human (Homo_sapiens.GRCh38) reference genome
wget -L ftp://ftp.ensembl.org/pub/release-${latest_release}/fasta/homo_sapiens/dna/Homo_sapiens.GRCh38.dna_sm.primary_assembly.fa.gz
# Download the Human (Homo_sapiens.GRCh38)gene annotation
wget -L ftp://ftp.ensembl.org/pub/release-${latest_release}/gtf/homo_sapiens/Homo_sapiens.GRCh38.${latest_release}.gtf.gz
The upstream analysis was performed on HPC using a nf-core pipeline
called rnaseq.
From nf-core/rnaseq pipeline, we will mainly use the
dark blue pipeline where Trim Galora, STAR,
and RSEM will be used for trimming, aligning, and
quantification, respectively. I used those tools as mentioned above
according to the reccomendation in STAT115 course which thought by
Prof. Xiaole Shirley Liu, Harvard university
To analyze a set of .fastq files, we have to create a
file named samplesheet.csv where includes all information
about our .fastq file.
In my case, my data is single-end sequencing read. Thus, I have only one
fastq file per each replicate of each sample.
# Go to directory that save `samplesheet.csv` file
cd data/rnaseq
# Observing the `samplesheet.csv` file
cat samplesheet.csv
Next, I define the needed information in nf-core/rnaseq
launch pipeline. After launch the workflow, I got
nf-params.json file. I download the file to run the
analysis it locally.
Here is my nf-params.json. Please note that we will run all
process on High Performance Computing (HPC), thus the file paths are
full path of HPC.
# Go to directory where the `nf-params.json file was saved
cd data/rnaseq
# Observing the `nf-params.json` file
cat nf-params.json
Now, All information is ready to perfrom nf-core/rnaseq
to analyze from 4 fastq files to gene count. To start
nf-core/rnaseq process, I run the following command.
nextflow run nf-core/rnaseq -r 3.14.0 -name HYPOX -resume -params-file nf-params.json -profile docker
The process will start and show the picture below. Moreover, it will
real-time update the status of each process.
We can observe QC from both pre and post-alignment and many more
visualization created by the tools provided by
nf-core/rnaseq pipeline from MultiQC.html
file.
dplyr and readr are the needed R library in
this section, so I called it.
library(dplyr)
library(readr)
raw_counts <- read_tsv("/home/rstudio/R_docker/port_RNA/MasteringR/From_HPC/STAR_RSEM/rsem.merged.gene_counts.tsv")
# observing head of the data
head(raw_counts)
# Convert all values in `raw_counts_mat` to integer according to DESeq2 input request.
# Use apply to convert columns 3 to 6 to integers
raw_counts[, 3:6] <- apply(raw_counts[, 3:6], 2, as.integer)
#raw_counts_r$ENSEMBL <- raw_counts$gene_id
#raw_counts <- raw_counts_r
# Assuming spc_tbl_ is your tibble/data frame
# Convert numeric columns to integers
spc_tbl_ <- raw_counts %>%
mutate(across(starts_with("sgCTRL"), as.integer)) # Change this to match your column names if needed
# Print the modified data frame
print(spc_tbl_)
ENSEMBL to the official gene symbolFrom column gene_id, it provide ue the ENSEMBL ID, we
next need to convert it to readable pattern liked gene symbol using the
org.Hs.eg.db Bioconductor package
library(org.Hs.eg.db)
map<- AnnotationDbi::select(org.Hs.eg.db,
keys = raw_counts$gene_id, # column that contain key (we want to convert this col.)
keytype = "ENSEMBL", # type of key in the column,
columns= "SYMBOL")# What we want to retrieve
head(map)
If you are not sure about how to call each type of data in
keytype and columns, you can observe how each
type was named using the following R code.
columns(org.Hs.eg.db)
keytypes(org.Hs.eg.db)
map)We next want the raw_counts with the gene symbol. To do
this, we join raw_counts with map by matching
the ENSEMBL. To join those two together, we need a share column which is
ENSEMBL in this case. Therefore, we have to check the
column name.
library(dplyr)
# Change the column name to `ENSEMBL`
colnames(raw_counts)[colnames(raw_counts) == "gene_id"] <- "ENSEMBL"
# Map `raw_counts` with `map`
raw_counts_map <- left_join(raw_counts, map)
head(raw_counts_map)
# Write `map` as .csv file
write.csv(raw_counts_map, file = "result/HypoxUpstream_raw_counts_map.csv", row.names = TRUE)
We will delete gene column first, then put it back
later
# Delete the `transcript_id`, `ENSEMBL`, and `gene` column and change all gene count columns to matrix.
raw_counts_mat <- raw_counts_map[, -c(1, 2, 7)] %>%
as.matrix()
# Put the `gene` column back.
rownames(raw_counts_mat) <- raw_counts_map$SYMBOL
head(raw_counts_mat)
# Write `map_mat` as .csv file
write.csv(raw_counts_mat, file = "result/HypoxUpstream_raw_counts_mat.csv", row.names = TRUE)
I will subset only the common genes (common_genes) which
are the genes that found both in our data raw_counts_mat
and gene symbol annotation map$SYMBOL.
common_genes <- intersect(rownames(raw_counts_mat), map$SYMBOL)
head(common_genes)
Here is map which contains ENTREZID, length, and gene
symbol information that intersect and re-order according to
common_genes.
# select only the common genes and re-order them by common_genes
map<- map %>%
dplyr::slice(match(common_genes, SYMBOL))
head(map)
Here is raw_count_mat which contains gene symbol and
gene count information that intersect and re-order according to
common_genes.
# subset the common genes and re-order them by common_genes
#raw_counts_mat1 <- raw_counts_mat[common_genes, ]
#raw_counts_mat_subset <- raw_counts_mat[rownames(raw_counts_mat) %in% common_genes, ]
# select only the common genes and re-order them by common_genes
#raw_counts_mat <- raw_counts_mat %>% dplyr::slice(match(common_genes, SYMBOL))
# Use match() and direct indexing
#raw_counts_mat <- raw_counts_mat[, match(common_genes)]
raw_counts_df <- as.data.frame(raw_counts_mat)
raw_counts_subset <- raw_counts_df %>% subset(rownames(raw_counts_df) %in% common_genes) %>% as.matrix()
head(raw_counts_subset)
DESeq2 library.library(DESeq2)
DESeq2 analysis# Define condition of each column
coldata <- data.frame(condition = c("hypoxia", "hypoxia", "normoxia", "normoxia"))
rownames(coldata) <- colnames(raw_counts_mat)
# Recheck whether the name is match with the defined condition
coldata
DESeq2This start by create a DESeq2 object (dds) where
raw_counts_mat and coldata were used as input.
This object will used condition as the main factor.
dds <- DESeqDataSetFromMatrix(countData = raw_counts_subset,
colData = coldata,
design = ~ condition)
dds <- DESeq(dds)
dds
To get differential gene expression, the two conditions (normoxia vs hypoxia) will be used.
res <- results(dds, contrast = c("condition", "hypoxia", "normoxia"))
res
The volcano plot will label the most differential express genes
either upregulation or downregulation. Thus, I firstly define genes that
we will labelgenes_to_label.
genes_to_label <- res %>%
as.data.frame() %>%
tibble::rownames_to_column(var = "gene") %>%
filter(!stringr::str_detect(gene, "LOC"),
abs(log2FoldChange) >= 2.5,
padj <= 0.001)
head(genes_to_label)
Moreover, I also want to label the significantly different genes by
color. Therefore, the significant(sig) and not
significant(not_sig) genes will be defined by the code
below for the labeling purpose.
res_label <- res %>%
as.data.frame() %>%
tibble::rownames_to_column(var = "gene") %>%
mutate(sig = case_when(
!stringr::str_detect(gene, "LOC") &
abs(log2FoldChange) >= 2.5 &
padj <= 0.001 ~ "sig",
TRUE ~ "not_sig"
))
head(res_label)
Now, I create the volcano plot using ggplot2 library and
use ggrepel to prevent label overlap. Additionally, I add
horizontal and vertical lines to mark the highly significant level based
on p-values and fold change, respectively.
library(ggplot2)
library(ggrepel)
ggplot(res_label, aes(x = log2FoldChange, y = -log10(pvalue))) +
geom_point(aes(color = sig)) +
scale_color_manual(values = c("blue", "red")) +
ggrepel::geom_label_repel(data = genes_to_label, aes(label = gene))+
geom_hline(yintercept = 100, linetype = 2, color = "red") +
geom_vline(xintercept = c(-2.5, 2.5), linetype = 2, color = "red")+
theme_bw(base_size = 14)
# Calculate the normalized data for PCA plot
vsd <- vst(dds, blind=FALSE)
normalized_counts <- assay(vsd) %>% as.matrix()
# Calculate the principal components
pca_prcomp <- prcomp(t(normalized_counts), center = TRUE, scale. = FALSE)
# This is the PC values
pca_prcomp$x
# Create a DataFrame for visualization
PC1_and_PC2 <- data.frame(PC1 = pca_prcomp$x[, 1], PC2 = pca_prcomp$x[, 2], type = rownames(pca_prcomp$x))
PC1_and_PC2
This will generate the PCA plot.
# Create a PCA plot
ggplot(PC1_and_PC2, aes(x = PC1, y = PC2, col = type)) +
geom_point() +
geom_text(aes(label = type), hjust = 0, vjust = 0) +
coord_fixed()+
theme_classic()
significant_genes <- res %>%
as.data.frame() %>%
filter(padj <= 0.01, abs(log2FoldChange) >= 2) %>%
rownames()
head(significant_genes, 10)
I will use ComplexHeatmap for a heatmap plotting.
library(ComplexHeatmap)
significant_mat <- normalized_counts[significant_genes, ]
col_anno <- HeatmapAnnotation(
df = coldata,
col = list(condition = c("hypoxia" = "red", "normoxia" = "blue"))
)
Heatmap(
t(scale(t(significant_mat))),
top_annotation = col_anno,
show_row_names = FALSE,
name = "scaled normalized\nexpression"
)
Over-Representation Analysis (ORA) is a method that compares the list of genes of interest (e.g., DEGs) to a background set (all genes in the experiment) to identify pathways that are over-represented among the genes of interest. ### 5.1 Conversion of Gene Symbols
library(clusterProfiler)
#convert gene symbol to Entrez ID
significant_genes_map<- clusterProfiler::bitr(geneID = significant_genes,
fromType="SYMBOL", toType="ENTREZID",
OrgDb="org.Hs.eg.db")
head(significant_genes_map)
## background genes are genes that are detected in the RNAseq experiment
background_genes<- res %>%
as.data.frame() %>%
filter(baseMean != 0) %>%
tibble::rownames_to_column(var = "gene") %>%
pull(gene)
res_df<- res %>%
as.data.frame() %>%
filter(baseMean != 0) %>%
tibble::rownames_to_column(var = "gene")
background_genes_map<- bitr(geneID = background_genes,
fromType="SYMBOL",
toType="ENTREZID",
OrgDb="org.Hs.eg.db")
head(background_genes_map)
ego <- enrichGO(gene = significant_genes_map$ENTREZID,
universe = background_genes_map$ENTREZID,
OrgDb = org.Hs.eg.db,
ont = "BP",
pAdjustMethod = "BH",
pvalueCutoff = 0.01,
qvalueCutoff = 0.05,
readable = TRUE)
head(ego)
dotplot(ego)
# Load the msigdbr package
library(msigdbr)
# Retrieve Hallmark gene sets for Homo sapiens
m_df <- msigdbr(species = "Homo sapiens")
# Them_df object now contains information about the Hallmark gene sets.
head(m_df)
Let’s use the msigdbr function with the species parameter set to “Homo sapiens” and the category parameter set to “H” to retrieve Hallmark gene sets.
m_t2g <- msigdbr(species = "Homo sapiens", category = "H") %>%
dplyr::select(gs_name, entrez_gene)
To get an overview of the number of genes in each gene set, we can create a table:
# Count the number of genes in each gene set
table(m_t2g$gs_name)
let’s perform gene set enrichment analysis using these sets.
We’ll use a set of significant genes from our analysis (represented by significant_genes_map$ENTREZID) and the Hallmark gene sets (represented by m_t2g). This analysis will help us identify which Hallmark gene sets are enriched in our significant genes.
em <- enricher(significant_genes_map$ENTREZID, TERM2GENE = m_t2g,
universe = background_genes_map$ENTREZID)
head(em)
dotplot(em)
res_df <- res_df %>%
mutate(signed_rank_stats = sign(log2FoldChange) * -log10(pvalue)) %>%
left_join(background_genes_map, by = c("gene" = "SYMBOL")) %>%
arrange(desc(signed_rank_stats))
res_df <- res_df %>%
mutate(negative_log10pvalue = -log10(pvalue)) %>%
mutate(negative_log10pvalue = ifelse(is.infinite(negative_log10pvalue), 1000, negative_log10pvalue)) %>%
mutate(signed_rank_stats = sign(log2FoldChange) * negative_log10pvalue)
gene_list <- res_df$signed_rank_stats
names(gene_list) <- res_df$ENTREZID
em2 <- GSEA(gene_list, TERM2GENE = m_t2g)
head(em2)
em2@result
# save visualization in p1
p1<- gseaplot(em2, geneSetID = "HALLMARK_G2M_CHECKPOINT",
by = "runningScore", title = "HALLMARK_G2M_CHECKPOINT")
p1
# save visualization in p2
p2 <- gseaplot(em2, geneSetID = "HALLMARK_HYPOXIA",
by = "runningScore", title = "HALLMARK_HYPOXIA")
p2